[00:00.000 --> 00:05.160]  Hi, I'm Katie Doroshek. I am coming from the University of Washington from the Paul G.
[00:05.160 --> 00:10.180]  Allen School of Computer Science and Engineering with the Molecular Information Systems Lab,
[00:10.180 --> 00:15.180]  where we are really interested in doing interesting things with storing information in DNA and
[00:15.180 --> 00:22.360]  different types of sensing that kind of appeals to the DIY audience a little bit. So the project
[00:22.360 --> 00:26.800]  that I'm going to talk about today is called Porcupine, which is doing rapid and robust
[00:26.800 --> 00:32.160]  tagging of physical objects using DNA with highly separable nanopore signatures, or in
[00:32.280 --> 00:36.800]  a little bit more accessible terms, we tag stuff with DNA and we use nanopores. That's
[00:36.800 --> 00:42.500]  really all we're talking about here. So when we're talking about molecular tags, we are
[00:42.500 --> 00:48.320]  using DNA to identify physical objects. This is kind of in, could be in the same world
[00:48.320 --> 00:53.940]  as QR codes and RFID tags, but essentially we want to be able to tag an object with DNA,
[00:53.940 --> 00:57.380]  ship it or store it somewhere, and then be able to read it back on the other side and
[00:57.380 --> 01:02.540]  say, is this the same object? What's the information in the tag? Or is it completely wrong? And
[01:02.540 --> 01:06.900]  some applications of this include tracking and provenance. So you might have a high value
[01:06.900 --> 01:10.580]  item that you want to be able to make sure it's the same at the beginning and end of
[01:10.580 --> 01:17.720]  the transaction. Secret exchange and counterfeit detection. So maybe you've got a set of pills
[01:17.720 --> 01:22.620]  and you only want to be able to sample a few of them at a time. You could do that with
[01:22.700 --> 01:29.500]  a system like this. And we ended up using the Nanopore MinION device to detect these
[01:29.500 --> 01:35.020]  molecular tags. Some of our system requirements were that we wanted this to be DIY end-to-end
[01:35.020 --> 01:41.040]  by non-experts and not require a full biolab. I myself am a computer scientist and this
[01:41.040 --> 01:46.160]  is something that I can do with maybe a little bit of supervision, so it's not too challenging.
[01:46.420 --> 01:50.800]  And we wanted to be able to generate arbitrary tags on demand without having to do more DNA
[01:50.800 --> 01:56.260]  synthesis. This is one of the most expensive parts of encoding information when using DNA
[01:56.260 --> 02:01.080]  and so if we can do that ahead of time and just be able to copy it instead of having
[02:01.080 --> 02:06.900]  to generate new DNA for every piece of new information, that can save a lot of cost.
[02:07.720 --> 02:12.960]  We wanted to be able to decode quickly and accurately and also use minimal special equipment.
[02:12.960 --> 02:21.240]  So the Nanopore really fits pretty well with this kind of application. One challenge for
[02:21.240 --> 02:24.640]  typical sequencers is that they're often inaccessible to pretty much everybody except
[02:24.640 --> 02:29.820]  for very, very well-funded labs. These cost anywhere from tens to hundreds of thousands
[02:29.820 --> 02:33.880]  of dollars, which doesn't necessarily mean that they're bad, but it's just not very useful
[02:33.880 --> 02:39.060]  for DIY. And they're pretty large, they sit on a big bench top. I'm sure many of you have
[02:39.060 --> 02:44.620]  seen them before, but it's really hard to compare to this candy bar-sized device that
[02:44.620 --> 02:51.200]  can just be plugged into your laptop. And here's a picture of it plugged into a laptop.
[02:51.300 --> 02:56.140]  All the things that you need to run this are very small, like pipettes, a small centrifuge,
[02:56.140 --> 03:02.300]  so it doesn't require the whole lab. And just to briefly run over how this works, I took
[03:02.300 --> 03:09.700]  this diagram from a Nature article. Basically, there is this thin membrane that is present
[03:09.700 --> 03:14.600]  with little nanoscale pores, and there's an ionic current that is being run across this
[03:14.600 --> 03:21.440]  thing, and the current is being measured over time. And then the DNA is prepared such that
[03:21.440 --> 03:27.800]  there is an enzyme that will unwind the DNA for you, and it will flow through this pore.
[03:27.860 --> 03:34.400]  As the DNA gets unwound and is flying through, the current changes a little bit, and so you
[03:34.400 --> 03:39.320]  can go back and tell what DNA was in the pore based on what the current looks like. There's
[03:39.320 --> 03:47.320]  this little diagram in this corner here that shows basically different ionic current traces
[03:47.320 --> 03:52.280]  for each base. Now typically these will all be concatenated together, and we'll go back up to
[03:52.280 --> 03:58.520]  this open channel state, but it's really just a time series of ionic current measurements instead
[03:58.520 --> 04:07.720]  of actually directly reading the bases. So when we are creating a molecular tag within porcupine,
[04:07.720 --> 04:14.180]  what we start out with is just like any other digital tag, any RFID tag or QR code or anything,
[04:14.180 --> 04:19.180]  it's basically just digital information, a bunch of bits and ones and zeros. And each different
[04:19.180 --> 04:25.180]  bit is assigned a different type of DNA. So what will happen is you've got physically a little
[04:25.180 --> 04:32.400]  vial of this particular set of molecular bits that we call them. So when you have a bit that is one,
[04:32.400 --> 04:37.780]  you'll pipette that bit into the molecular tag mixture, and if it's zero, you completely leave
[04:37.780 --> 04:42.400]  it out. So we're really encoding information via presence or absence of these different types of
[04:42.400 --> 04:49.400]  DNA. Then we will apply this tag mixture to an object when dehydrated, ship it or store it
[04:49.400 --> 04:55.140]  somewhere, and then rehydrate the molecular tag using a buffer solution. We'll then load it into
[04:55.140 --> 05:02.580]  MinION and read it out using some software that we created. When we are actually creating
[05:02.580 --> 05:06.500]  these molecular tags, we're not just encoding the information directly, we've kind of got a step
[05:06.500 --> 05:11.220]  in there that is adding some error correction. So we have a digital tag that's a little bit shorter,
[05:11.220 --> 05:15.900]  code word that's longer, that has some additional bits in order to add some error correction,
[05:15.900 --> 05:21.440]  which I'll talk about more later, but then we are going to the molecular bits and the molecular tag.
[05:23.820 --> 05:30.940]  And a single molecule is made of a unique sequence and a specific length. So this is the
[05:30.940 --> 05:36.740]  part that lets us create more bits without having to sequence to create more DNA. So we have
[05:36.740 --> 05:42.920]  our barcode sequence at the very beginning of the strand, then a spacer, and then another barcode
[05:42.920 --> 05:48.980]  sequence and a sequencing adapter. And typically when encoding information in DNA, the information
[05:48.980 --> 05:54.960]  is recorded throughout the entire strand. And so I often get asked, why wouldn't you take advantage
[05:54.960 --> 06:02.360]  of the incredible density of DNA to make this happen? And really this comes down to creating
[06:02.360 --> 06:08.000]  more bits without having to synthesize more DNA. We start out with our nice little well plate of
[06:08.000 --> 06:13.440]  96 sequences, and if we just add two different lengths, we have 192 mole bits without ever
[06:13.440 --> 06:22.480]  having to go back to ask IDT for more short strands for us. And then the second part of it
[06:22.480 --> 06:27.760]  is that having this unique identifiable region means that we can avoid base calling. So when we
[06:27.760 --> 06:33.240]  turn this complicated nanopore signal into bases, it takes a very, very long time and it's very
[06:33.240 --> 06:39.560]  complex. And that is the largest source of the error in the pipeline for working with
[06:39.560 --> 06:43.720]  nanopore data, where we actually have a much simpler problem. We're not trying to identify
[06:43.720 --> 06:51.020]  any DNA, just the DNA that we know is already there. So we can turn this into a classification
[06:51.020 --> 06:56.480]  problem instead of a decoding problem. That saves a lot of time and increases our accuracy pretty
[06:56.480 --> 07:05.520]  dramatically. And we design molecular bits to have distinct nanopore signals. We do this using a tool
[07:05.520 --> 07:11.760]  called Scrappy that is produced by Oxford Nanopore itself. So we can give it a sequence, it will
[07:11.760 --> 07:17.360]  produce a theoretical signal, and then this is the actual nanopore signal. And we didn't really
[07:17.360 --> 07:23.040]  cherry-pick one of these. They kind of all look similar, maybe a little bit stretched or narrower,
[07:23.040 --> 07:30.020]  but this allows us to really be able to design these sequences to look different, which makes
[07:30.020 --> 07:36.440]  our problem a lot easier. When we are designing these sequences, we're using an evolutionary
[07:36.440 --> 07:41.620]  process. We're starting out with an initial batch of them. We throw our 90 sticks in our virtual
[07:41.620 --> 07:48.860]  well plate, simulate what they look like, and then compute how different they are. And we will then
[07:48.860 --> 07:55.840]  start an evolutionary process where we shuffle them, mutate them, and then make sure that we're
[07:55.840 --> 08:01.080]  improving things. It's kind of like a guess and check method that will make these look visually
[08:01.080 --> 08:10.540]  different. And basically the lighter colors here are more similar, so they look like similar
[08:10.540 --> 08:15.920]  swiggles. And then you can see that this guy looks very different from this after this full process
[08:15.920 --> 08:21.740]  here. I don't want to pretend like this is the first time anybody has come up with anything that
[08:21.740 --> 08:27.460]  is working with raw nanopore data. There's in particular a group that is working with
[08:27.460 --> 08:34.760]  demultiplexing. And for anybody unfamiliar, multiplexing is a tool that is used to add
[08:34.760 --> 08:41.600]  barcodes on to a sample so that any reads that you get back out on the other side can be associated
[08:41.600 --> 08:47.300]  with a particular sample. And then at the end you can go separate things out and make it so that you
[08:47.300 --> 08:53.200]  are only working on your one sample at a time. And what they've done in this case is taken
[08:53.860 --> 09:00.000]  the barcodes that Oxford Nanopore has produced for multiplexing, they found a subset of them,
[09:00.000 --> 09:07.800]  and then can identify those using the raw nanopore signal. One challenge that they have is that they
[09:07.800 --> 09:13.360]  didn't have the ability to design them specifically to identify them later, meaning that they are
[09:13.360 --> 09:19.240]  working with a much more challenging problem than we are. But it's a pretty cool tool and
[09:19.240 --> 09:26.940]  it's been very useful for folks. And then another one, it's currently you can't just go buy
[09:28.220 --> 09:32.420]  multiplexing barcodes for RNA. And so there's a group that developed four barcodes that they
[09:32.420 --> 09:36.620]  could then identify, which is really similar to what we're doing. We kind of developed it
[09:36.620 --> 09:39.560]  independently, but it is a similar process.
[09:42.060 --> 09:46.600]  However, and the way that all of these work is kind of classifying them similarly. We're
[09:46.600 --> 09:53.240]  trying to just identify what little barcode is present in the individual reads. Our training
[09:53.240 --> 09:57.780]  data, we label the squiggles using sequencing data and then spread all the bits across a bunch
[09:57.780 --> 10:04.420]  of different runs, then test data with half of the bits, and we ended up using a five-layer CNN with
[10:04.700 --> 10:08.580]  a fully connected layer and softmax. And this is all stuff that I'm happy to answer more questions
[10:08.580 --> 10:17.960]  about in greater detail later if desired. Our classification accuracy is very high. And the
[10:17.960 --> 10:22.260]  only reason that I actually say this is to show you that identifying mole bits is a totally
[10:22.260 --> 10:28.900]  non-issue. Our training accuracy is like 99.9 something, validation 97.7. It's kind of to the
[10:28.900 --> 10:35.780]  point where it's not really an issue for us to be able to identify the mole bits. However,
[10:36.920 --> 10:40.660]  identifying what the tag is based on the mole bits can be a little bit more challenging.
[10:41.780 --> 10:46.580]  Now that we have our sandbox, we've got our molecular bits and we have a way to read them,
[10:46.580 --> 10:52.480]  the question is what should we encode and how should we encode it? We have our 96 bits and we
[10:52.480 --> 10:57.880]  have kind of our framework where we might want to expand our digital tag, but we have to think about
[10:57.880 --> 11:04.180]  how we might want to do that. The most naive encoding scheme of course is mapping one digital
[11:04.180 --> 11:09.120]  bit to one molecular bit. And this is of course not ideal because one bit error means that you
[11:09.120 --> 11:14.600]  completely get the entire tag wrong. And so we want to add some error correction as alluded to
[11:14.600 --> 11:20.440]  previously. What we'll do is we'll reserve some bits for the tag and then use the rest of them
[11:20.440 --> 11:26.820]  to correct errors. And in our case we picked a 32-bit message and then multiplied by a 32 by 96
[11:26.820 --> 11:32.540]  random generator matrix to produce a code word which then gets put in the molecular tag.
[11:32.860 --> 11:38.640]  And this allows us to get up to 18 bits wrong, which is an enormous amount without ever worrying
[11:38.640 --> 11:43.400]  about whether we're going to get the tag wrong. This is something that can also be chosen depending
[11:43.400 --> 11:49.080]  on the application. It's just an example that we have for basically our paper that we wrote.
[11:49.080 --> 11:53.820]  But if you wanted to take these 96 bits and use them in another way, there's nothing that
[11:53.820 --> 12:01.760]  would stop anybody from doing that. So just to briefly walk through encoding an actual message
[12:01.760 --> 12:06.900]  here because it's a little bit hard to kind of to connect the two. We'll start out with our digital
[12:06.900 --> 12:11.820]  tag here. We've got, we've encoded, we had to encode our molecular information systems lab
[12:11.820 --> 12:17.920]  of course. And then we added our bits for the code word. Then we'll go through the process
[12:17.920 --> 12:23.320]  of actually encoding and sequencing and doing all the wet lab stuff. And then what we'll get back
[12:23.320 --> 12:29.680]  out of the sequencing machine is a set of read counts. So how many, how many times did we observe
[12:29.680 --> 12:35.500]  that particular molecular bit? And then we have to decide like at what point do we want to set it to
[12:35.500 --> 12:40.800]  one, but at what point do we want to set it to zero? So we have a threshold here where anything
[12:40.800 --> 12:46.320]  that is above this line we'll set to one and below it is zero. And then we get a few incorrect bits.
[12:46.320 --> 12:53.180]  This is okay. We have fewer than our 18, so we'll still decode correctly. But we also realize
[12:53.180 --> 12:58.880]  that there's kind of a large variation in read counts here. And we found that this was reproducible.
[12:58.900 --> 13:03.100]  Still cannot figure out why. I'm more than happy to talk to people about why this might be happening
[13:03.100 --> 13:09.100]  because we've tried a million things. But it's reproducible so we're able to rescale the counts
[13:09.100 --> 13:15.800]  and essentially what happens is that we get no bit errors after doing that. And we recover our final,
[13:15.800 --> 13:24.820]  final decoded message here. So our final results, we are talking about how long does it take to
[13:24.820 --> 13:30.580]  actually decode a message. Do you need to run this for hours? Overnight? Seconds? Minutes? And so what
[13:30.580 --> 13:36.060]  we have here is on the x-axis we've got our sequencing runtime in seconds in the log scale.
[13:36.060 --> 13:42.300]  And then we've got our code word distance over here. So you can imagine that in reality you
[13:42.300 --> 13:50.160]  could possibly get all 96 bits wrong. However, because we have used error correcting codes,
[13:50.160 --> 13:58.060]  the maximum or the minimum distance between all of them is 18 bits. So if you get 19 bits wrong,
[13:58.060 --> 14:03.480]  in some cases you'll actually get a totally different code word. So in this context with
[14:03.480 --> 14:08.880]  the error correction, the maximum number of bits can be anywhere from like 18 to 20-ish.
[14:09.860 --> 14:15.880]  And so what we have here is, as expected, as your sequencing runtime goes down,
[14:15.880 --> 14:20.020]  or as your sequencing runtime increases, the chance of you getting the message wrong goes
[14:20.020 --> 14:24.720]  down because you've absorbed enough reads to be really confident. Each x here is an incorrect
[14:24.720 --> 14:30.400]  decoding and then our dashed line is guaranteed correct decoding with error correction.
[14:32.360 --> 14:38.680]  And we are able to decode with only about 10 seconds of data, which is really nice because
[14:38.680 --> 14:43.220]  with nanopore sequencing you can just stop at any point. You don't have to keep running this
[14:43.220 --> 14:47.260]  for hours and hours because the process doesn't require you to do that. It's just physical
[14:47.980 --> 14:56.300]  strands flowing through this membrane. And what this means is that we can do pretty close to
[14:56.300 --> 15:03.560]  real-time reading. Another cool thing is that we've made the molecular tags shelf-stable.
[15:03.560 --> 15:07.320]  This is really important for basically any kind of application you can think of where
[15:07.320 --> 15:12.080]  you'd want to tag something for more than a few minutes. You want to be able to ship or store
[15:12.080 --> 15:18.160]  your object, but then afterwards we can rehydrate it and then read it, and that has been a crucial
[15:18.160 --> 15:23.660]  part of this. So we'll prepare this take for sequencing immediately after assembling the tag.
[15:23.720 --> 15:29.060]  This is a step that takes about one to two hours, so we are front-loading all the lab work
[15:29.540 --> 15:33.940]  on the writing side, meaning when you go to read it you don't have to spend
[15:33.940 --> 15:41.640]  hardly any time at all, like seconds. We also sent a tag in the mail and just through regular
[15:41.640 --> 15:46.860]  mail like USPS to California and we could recover everything after about four weeks.
[15:46.900 --> 15:52.920]  I'm not sure what the upper bound is on how long these things will last, but we basically
[15:52.920 --> 15:59.320]  got the same sample back that we sent. So I think there's a lot of work to be done
[15:59.320 --> 16:07.960]  and figuring out what are the bounds on what kind of surfaces these things can be attached
[16:07.960 --> 16:13.260]  to and how long they will last, but initial results are kind of promising for these being
[16:13.260 --> 16:18.620]  actually stable. And then because we're attaching the sequencing adapter ahead of time, the part
[16:18.620 --> 16:23.900]  that's actually doing the unwinding of the DNA, you can't get any contamination. It's just not
[16:23.900 --> 16:29.820]  possible to happen after that process is completed because you have to have that adapter in order for
[16:29.820 --> 16:35.860]  it to be read. And so putting it on a surface means that you could have someone's like environmental
[16:35.860 --> 16:44.000]  DNA or anything and it will still mostly read out fine. Another big question I get is like,
[16:44.000 --> 16:50.740]  okay cool, but what can we actually use this for? And we haven't really explored the extent of what
[16:50.740 --> 16:55.920]  kinds of surfaces or anything that this could be used on, but we can think of things that are
[16:55.920 --> 17:02.680]  traditionally difficult to tag with QR codes and RFID codes and such. This might include things
[17:02.680 --> 17:08.220]  like liquids, maybe food with some safety testing. We make no claims about it actually being safe
[17:08.220 --> 17:16.460]  for food without testing. Paper and commodities. So maybe you have a supply chain where you want
[17:16.460 --> 17:21.740]  to take everything and then only read back a few of them. You can really amortize the cost
[17:21.740 --> 17:28.340]  because reading is going to be the highest cost in this process. And there's some prior work that
[17:28.340 --> 17:34.180]  demonstrates some DNA encapsulation methods. These methods add some time and effort to the process,
[17:34.180 --> 17:40.080]  but they could be used to further extend the life past what this would normally live on its own for.
[17:41.560 --> 17:48.540]  So in summary, we have our molecular tagging system that uses our DNA to tag physical objects.
[17:49.000 --> 17:54.180]  The design uses an evolutionary model for nanopore orthogonal sequences, making them
[17:54.180 --> 17:58.880]  look visually as different as possible and essentially trying to make our classification
[17:58.880 --> 18:06.140]  problem easier. And then we classify using a CNN. We will encode and decode using our
[18:06.140 --> 18:12.580]  random generator matrix, but again, agnostic to any type of encoding that you want to use.
[18:12.580 --> 18:18.060]  And we can get readout with less than 10 seconds of data. Future work might be like using a
[18:18.060 --> 18:23.180]  generative model to design these instead of an evolutionary model that's a guess and check.
[18:23.820 --> 18:28.580]  And also considering these different kinds of encoding and decoding methods that might be
[18:28.580 --> 18:34.780]  able to take advantage of different parts of the known error in the system, which we haven't really
[18:34.780 --> 18:40.980]  done at all. Another thing to consider, especially because this is a security-minded conference,
[18:40.980 --> 18:49.760]  is what kind of security this actually provides. And one nice thing is that DNA is invisible.
[18:49.760 --> 18:56.200]  However, security by obfuscation is not always the right kind of security. It can be good in
[18:56.200 --> 19:00.940]  some circumstances, but you're never going to fool anybody who's really dedicated to
[19:00.940 --> 19:07.020]  getting access to this just by making it invisible. And so there are applications where
[19:07.020 --> 19:11.920]  this would be useful and maybe not, and I'm really curious if anybody has thoughts on
[19:13.180 --> 19:19.980]  maybe this sparked some kind of interesting application that we haven't thought of yet.
[19:21.260 --> 19:26.340]  I also just wanted to give a really quick plug for the other things that our lab is doing,
[19:26.340 --> 19:30.540]  because I think it's really cool and I only work on a very small segment of this.
[19:30.940 --> 19:35.500]  Our lab works primarily on something called DNA data storage, where we're trying to store
[19:35.500 --> 19:42.100]  information in DNA and read it back later, but not in terms of presence or absence,
[19:42.100 --> 19:49.420]  but actually encoding information in the bases of the DNA. And this has been pretty far developed,
[19:50.060 --> 19:58.320]  but still, the writing aspect is very expensive. The cheapest I've ever seen DNA for is like
[19:58.320 --> 20:02.560]  seven cents per base, which is astronomical when you consider large volumes of DNA.
[20:03.240 --> 20:08.640]  And that's been a pretty large barrier, but it's part of our, I think, what's a really cool,
[20:08.640 --> 20:14.560]  cool idea. DNA security. So this is part of our lab called CyBioSecurity. One paper that
[20:14.560 --> 20:22.680]  came out recently was talking about the GenBank system and some of the security implications.
[20:22.680 --> 20:28.860]  You also have microfluidic automation, so how do you abstract away some of the tedious parts
[20:28.860 --> 20:36.780]  of working in the lab, and then DNA circuits. And of course, nanopore sensing is what I personally
[20:36.780 --> 20:44.940]  work on. And with that, I will wrap up and I'm happy to take any questions in the live Q&A after
[20:44.940 --> 20:48.140]  the session. Thank you.
